Search CORE

250 research outputs found

The Emergence of Norms via Contextual Agreements in Open Societies

Author: AG Barto
D Fudenberg
DJ Watts
J Epstein
JR Kok
M Bowling
ML Puterman
O Sen
R Albert
Y Shoham
Publication venue
Publication date: 24/03/2015
Field of study

This paper explores the emergence of norms in agents' societies when agents play multiple -even incompatible- roles in their social contexts simultaneously, and have limited interaction ranges. Specifically, this article proposes two reinforcement learning methods for agents to compute agreements on strategies for using common resources to perform joint tasks. The computation of norms by considering agents' playing multiple roles in their social contexts has not been studied before. To make the problem even more realistic for open societies, we do not assume that agents share knowledge on their common resources. So, they have to compute semantic agreements towards performing their joint actions. %The paper reports on an empirical study of whether and how efficiently societies of agents converge to norms, exploring the proposed social learning processes w.r.t. different society sizes, and the ways agents are connected. The results reported are very encouraging, regarding the speed of the learning process as well as the convergence rate, even in quite complex settings

arXiv.org e-Print Archive

Crossref

Probabilistic Timed Automata with Clock-Dependent Probabilities

Author: A Abate
G Bian
G Norman
JG Kemeny
M Jurdziński
M Kattenbelt
M Kwiatkowska
M Kwiatkowska
M Kwiatkowska
M Minsky
ML Puterman
N Basset
P Bouyer
R Alur
Publication venue
Publication date: 01/01/2017
Field of study

Probabilistic timed automata are classical timed automata extended with discrete probability distributions over edges. We introduce clock-dependent probabilistic timed automata, a variant of probabilistic timed automata in which transition probabilities can depend linearly on clock values. Clock-dependent probabilistic timed automata allow the modelling of a continuous relationship between time passage and the likelihood of system events. We show that the problem of deciding whether the maximum probability of reaching a certain location is above a threshold is undecidable for clock-dependent probabilistic timed automata. On the other hand, we show that the maximum and minimum probability of reaching a certain location in clock-dependent probabilistic timed automata can be approximated using a region-graph-based approach.Comment: Full version of a paper published at RP 201

arXiv.org e-Print Archive

Crossref

Institutional Research Information System University of Turin

The Impatient May Use Limited Optimism to Minimize Regret

Author: B Aminof
C Reutenauer
CJCH Watkins
E Allender
E Filiot
F Cucker
J Filar
JY Halpern
KR Apt
L Alfaro de
LS Shapley
M Jurdzinski
ML Puterman
P Hunter
R Brenguier
U Zwick
Publication venue
Publication date: 17/11/2018
Field of study

Discounted-sum games provide a formal model for the study of reinforcement learning, where the agent is enticed to get rewards early since later rewards are discounted. When the agent interacts with the environment, she may regret her actions, realizing that a previous choice was suboptimal given the behavior of the environment. The main contribution of this paper is a PSPACE algorithm for computing the minimum possible regret of a given game. To this end, several results of independent interest are shown. (1) We identify a class of regret-minimizing and admissible strategies that first assume that the environment is collaborating, then assume it is adversarial---the precise timing of the switch is key here. (2) Disregarding the computational cost of numerical analysis, we provide an NP algorithm that checks that the regret entailed by a given time-switching strategy exceeds a given value. (3) We show that determining whether a strategy minimizes regret is decidable in PSPACE

arXiv.org e-Print Archive

Crossref

Institutional Repository Universiteit Antwerpen

DI-fusion

The Complexity of Graph-Based Reductions for Reachability in Markov Decision Processes

Author: AL Strehl
C Baier
C Courcoubetis
C Dehnert
Krishnendu Chatterjee
L Valiant
LP Kaelbling
M Kwiatkowska
M Steinmetz
ML Puterman
N Fijalkow
PR D’Argenio
S Fortune
SJ Russell
T Brázdil
T Eilam-Tzoreff
Publication venue
Publication date: 01/01/2018
Field of study

We study the never-worse relation (NWR) for Markov decision processes with an infinite-horizon reachability objective. A state q is never worse than a state p if the maximal probability of reaching the target set of states from p is at most the same value from q, regard- less of the probabilities labelling the transitions. Extremal-probability states, end components, and essential states are all special cases of the equivalence relation induced by the NWR. Using the NWR, states in the same equivalence class can be collapsed. Then, actions leading to sub- optimal states can be removed. We show the natural decision problem associated to computing the NWR is coNP-complete. Finally, we ex- tend a previously known incomplete polynomial-time iterative algorithm to under-approximate the NWR

arXiv.org e-Print Archive

Crossref

Institutional Repository Universiteit Antwerpen

DI-fusion

Approximating Euclidean by Imprecise Markov Decision Processes

Author: A David
A Erreygers
CC White III
G Rubino
H Itoh
M Jaeger
ML Puterman
P Billingsley
S Derisavi
T Chen
Publication venue
Publication date: 01/01/2020
Field of study

Euclidean Markov decision processes are a powerful tool for modeling control problems under uncertainty over continuous domains. Finite state imprecise, Markov decision processes can be used to approximate the behavior of these infinite models. In this paper we address two questions: first, we investigate what kind of approximation guarantees are obtained when the Euclidean process is approximated by finite state approximations induced by increasingly fine partitions of the continuous state space. We show that for cost functions over finite time horizons the approximations become arbitrarily precise. Second, we use imprecise Markov decision process approximations as a tool to analyse and validate cost functions and strategies obtained by reinforcement learning. We find that, on the one hand, our new theoretical results validate basic design choices of a previously proposed reinforcement learning approach. On the other hand, the imprecise Markov decision process approximations reveal some inaccuracies in the learned cost functions

arXiv.org e-Print Archive

Crossref

VBN

Managing inventory and production capacity in start-up firms

Author: Edgar Possani
Lyn C Thomas
Puterman ML
Silver EA
Thomas W Archibald
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2015
Field of study

We consider the problem of managing inventory and production capacity in a start-up manufacturing firm with the objective of maximising the probability of the firm surviving as well as the more common objective of maximising profit. Using Markov decision process models, we characterise and compare the form of optimal policies under the two objectives. This analysis shows the importance of coordination in the management of inventory and production capacity. The analysis also reveals that a start-up firm seeking to maximise its chance of survival will often choose to keep production capacity significantly below the profit-maximising level for a considerable time. This insight helps us to explain the seemingly cautious policies adopted by a real start-up manufacturing firm

Southampton (e-Prints Soton)

Crossref

Edinburgh Research Explorer

Mean-Payoff Optimization in Continuous-Time Markov Chains with Parametric Alarms

Author: A Jovanovic
A Jovanović
C Haase
C Lindemann
DLP Minh
DP Bertsekas
EG Amparore
EM Hahn
H Choi
JR Norris
L Alfaro
L-M Traonouez
M Češka
ML Puterman
PJ Haas
R German
SK Jha
T Brázdil
T Brázdil
W Nelson
Publication venue
Publication date: 20/06/2017
Field of study

Continuous-time Markov chains with alarms (ACTMCs) allow for alarm events that can be non-exponentially distributed. Within parametric ACTMCs, the parameters of alarm-event distributions are not given explicitly and can be subject of parameter synthesis. An algorithm solving the

\varepsilon

-optimal parameter synthesis problem for parametric ACTMCs with long-run average optimization objectives is presented. Our approach is based on reduction of the problem to finding long-run average optimal strategies in semi-Markov decision processes (semi-MDPs) and sufficient discretization of parameter (i.e., action) space. Since the set of actions in the discretized semi-MDP can be very large, a straightforward approach based on explicit action-space construction fails to solve even simple instances of the problem. The presented algorithm uses an enhanced policy iteration on symbolic representations of the action space. The soundness of the algorithm is established for parametric ACTMCs with alarm-event distributions satisfying four mild assumptions that are shown to hold for uniform, Dirac and Weibull distributions in particular, but are satisfied for many other distributions as well. An experimental implementation shows that the symbolic technique substantially improves the efficiency of the synthesis algorithm and allows to solve instances of realistic size.Comment: This article is a full version of a paper accepted to the Conference on Quantitative Evaluation of SysTems (QEST) 201

arXiv.org e-Print Archive

Crossref

Maximizing the Conditional Expected Reward for Reaching the Goal

Author: C Acerbi
C Baier
C Baier
C Baier
DP Bertsekas
F Gretz
G Barthe
G Seber
J-P Katoen
K Chatterjee
K Chatzikokolakis
L Alfaro
L Kallenberg
M Kwiatkowska
M Randour
ME Andrés
ME Andrés
ML Puterman
MS Alvim
T Brázdil
Publication venue
Publication date: 19/01/2017
Field of study

The paper addresses the problem of computing maximal conditional expected accumulated rewards until reaching a target state (briefly called maximal conditional expectations) in finite-state Markov decision processes where the condition is given as a reachability constraint. Conditional expectations of this type can, e.g., stand for the maximal expected termination time of probabilistic programs with non-determinism, under the condition that the program eventually terminates, or for the worst-case expected penalty to be paid, assuming that at least three deadlines are missed. The main results of the paper are (i) a polynomial-time algorithm to check the finiteness of maximal conditional expectations, (ii) PSPACE-completeness for the threshold problem in acyclic Markov decision processes where the task is to check whether the maximal conditional expectation exceeds a given threshold, (iii) a pseudo-polynomial-time algorithm for the threshold problem in the general (cyclic) case, and (iv) an exponential-time algorithm for computing the maximal conditional expectation and an optimal scheduler.Comment: 103 pages, extended version with appendices of a paper accepted at TACAS 201

arXiv.org e-Print Archive

Crossref

Value Iteration for Long-run Average Reward in Markov Decision Processes

Author: A Komuravelli
A McIver
AF Veinott
AK McIver
C Baier
C Courcoubetis
J Filar
K Chatterjee
K Chatterjee
K Chatterjee
K Chatterjee
M Duflot
M Kwiatkowska
M Kwiatkowska
M Kwiatkowska
ML Puterman
O Michael
RA Howard
S Giro
S Haddad
T Brázdil
T Brázdil
T Brázdil
Publication venue
Publication date: 31/08/2017
Field of study

Markov decision processes (MDPs) are standard models for probabilistic systems with non-deterministic behaviours. Long-run average rewards provide a mathematically elegant formalism for expressing long term performance. Value iteration (VI) is one of the simplest and most efficient algorithmic approaches to MDPs with other properties, such as reachability objectives. Unfortunately, a naive extension of VI does not work for MDPs with long-run average rewards, as there is no known stopping criterion. In this work our contributions are threefold. (1) We refute a conjecture related to stopping criteria for MDPs with long-run average rewards. (2) We present two practical algorithms for MDPs with long-run average rewards based on VI. First, we show that a combination of applying VI locally for each maximal end-component (MEC) and VI for reachability objectives can provide approximation guarantees. Second, extending the above approach with a simulation-guided on-demand variant of VI, we present an anytime algorithm that is able to deal with very large models. (3) Finally, we present experimental results showing that our methods significantly outperform the standard approaches on several benchmarks

arXiv.org e-Print Archive

Crossref

Optimizing Performance of Continuous-Time Stochastic Systems using Timeout Synthesis

Author: C Baier
C Haase
C Lindemann
CC Guet
EM Hahn
H Choi
JR Norris
K Etessami
K Ramamritham
L Carnevali
M Kwiatkowska
MA Marsan
MF Neuts
ML Puterman
NC Audsley
P Buchholz
R Alur
R Obermaisser
SK Jha
T Brázdil
T Brázdil
T Brázdil
W Feller
Ĺ Korenčiak
Publication venue
Publication date: 15/04/2016
Field of study

We consider parametric version of fixed-delay continuous-time Markov chains (or equivalently deterministic and stochastic Petri nets, DSPN) where fixed-delay transitions are specified by parameters, rather than concrete values. Our goal is to synthesize values of these parameters that, for a given cost function, minimise expected total cost incurred before reaching a given set of target states. We show that under mild assumptions, optimal values of parameters can be effectively approximated using translation to a Markov decision process (MDP) whose actions correspond to discretized values of these parameters

arXiv.org e-Print Archive

Crossref